Reproducibility in R

1 Gapminder Data

We are going to use some of the data on https://www.gapminder.org for this workshop. Their statement says: Our mission is to fight devastating ignorance with a fact-based worldview everyone can understand. We will use a collection of data describing the life expectancy by year and country of birth – https://www.gapminder.org/data/documentation/gd004/

2 Load packages and import data

library('here')
library('tidyr')
library('ggplot2')
library('plotly')
library('quarto')
library('dplyr')
library('pwr')
library('pdftools')
library('stringr')
library('patchwork')

We have already imposed some structure on the contents of this repository. The data folder has three sub-folders – raw, intermediate, and processed. We can try to import the raw data as follows:

Run here to figure out current path:

here()
[1] "/Users/jwalla12/repos/reproducibility_r"

Tell here the path to the current document:

i_am("notebooks/quarto/book.qmd")
here() starts at /Users/jwalla12/repos/reproducibility_r

here will give us the top level of the project:

here()
[1] "/Users/jwalla12/repos/reproducibility_r"

Which we can use to build the full path to the file:

gm_lex_raw <- read.csv(here('data/raw/GM_lex_by_country.csv'))

This still works even if we set setwd()

setwd('/Users/jwalla12/repos/reproducibility_r/notebooks/quarto')
getwd()
[1] "/Users/jwalla12/repos/reproducibility_r/notebooks/quarto"
here()
[1] "/Users/jwalla12/repos/reproducibility_r"

We can use the head function to look at the first few lines of the data we just imported:

head(gm_lex_raw)
  geo        name time Life.expectancy
1 afg Afghanistan 1800           28.21
2 afg Afghanistan 1801           28.20
3 afg Afghanistan 1802           28.19
4 afg Afghanistan 1803           28.18
5 afg Afghanistan 1804           28.17
6 afg Afghanistan 1805           28.16

And we can use $ to get or set column data:

head(gm_lex_raw$Life.expectancy)
[1] 28.21 28.20 28.19 28.18 28.17 28.16
gm_lex_raw$some_data <- 'new_column'
head(gm_lex_raw)
  geo        name time Life.expectancy  some_data
1 afg Afghanistan 1800           28.21 new_column
2 afg Afghanistan 1801           28.20 new_column
3 afg Afghanistan 1802           28.19 new_column
4 afg Afghanistan 1803           28.18 new_column
5 afg Afghanistan 1804           28.17 new_column
6 afg Afghanistan 1805           28.16 new_column

We can remove a column by setting it to NULL

gm_lex_raw$some_data <- NULL
head(gm_lex_raw)
  geo        name time Life.expectancy
1 afg Afghanistan 1800           28.21
2 afg Afghanistan 1801           28.20
3 afg Afghanistan 1802           28.19
4 afg Afghanistan 1803           28.18
5 afg Afghanistan 1804           28.17
6 afg Afghanistan 1805           28.16

3 Filtering and plotting data

When analyzing your data in a program like Excel, you might want to filter data to look at a particular category of interest. We can do this in R as well. Let’s look at the Canada data using the filter function from dplyr. I am using the dplyr::filter notation to specify that I want to use the filter from dplyr – not some other package that also has a function called filter.

c1 <- dplyr::filter(gm_lex_raw, name == 'Canada')

We can also reference the gm_lex_raw object by its rows and columns using [rows, columns]

#first two rows
gm_lex_raw[1:2,]
  geo        name time Life.expectancy
1 afg Afghanistan 1800           28.21
2 afg Afghanistan 1801           28.20
#first two rows and first three columns:
gm_lex_raw[1:2,1:3]
  geo        name time
1 afg Afghanistan 1800
2 afg Afghanistan 1801

We can use this notation to filter the gm_lex_raw data, similarly to how we filtered using dplyr:

c2 <- gm_lex_raw[gm_lex_raw$name %in% 'Canada',]

Although they aren’t informative variable names and probably not the best choice, R will still let you name your variables c1 and c2. There are some variable names that R disallows:

print(?Reserved)

# Reserved {base}   R Documentation
# Reserved Words in R
# Description
# The reserved words in R's parser are
# 
# if else repeat while function for in next break
# 
# TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_ NA_character_
# 
# ... and ..1, ..2 etc, which are used to refer to arguments passed down from a calling function, see ....
# 
# Details
# Reserved words outside quotes are always parsed to be references to the objects linked to in the ‘Description’, and hence they are not allowed as syntactic names (see make.names). They are allowed as non-syntactic names, e.g. inside backtick quotes.

We will make a few figures using ggplot2. The general syntax of a ggplot is as follows:

ggplot(data=<DATA>) +
    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)

We can make an example plot:

data(mpg)
ggplot(data=mpg) + 
  geom_point(mapping=aes(x=displ,y=hwy))

We can also specify the mapping = aes inside the call to ggplot instead of the call to the geom_function, like this:

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_point()

The ggplot() function contains global mapping, while each geom has a local mapping.

Global mapping of displ and hwy creates the x and y axes:

ggplot(data=mpg, mapping=aes(x=displ,y=hwy))

Mapping color to class for point geom while using global x and y mappings:

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_point(mapping=aes(color=class))

geom_smooth doesn’t need any mapping arguments if using global mapping:

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
    geom_point(mapping=aes(color=class))+
    geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also make a second geom_smooth that will use the same x and y mapping, but will plot data coming from no_2seaters object:

no_2seaters <- filter(mpg, class != "2seater")

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth() +
  geom_smooth(data = no_2seaters)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can also add titles to our figures. For example, let’s make a quick plot of the Canada data, using the ggtitle argument to add a title indicating that it is data from Canada.

can_plot <- ggplot(data = c1) +
   geom_line(mapping = aes(x = time, y = Life.expectancy)) + 
  ggtitle('Canada')
can_plot

We could also use labs to set some titles (see ?labs for more details)

ggplot(data = c1) +
   geom_line(mapping = aes(x = time, y = Life.expectancy)) + 
  labs(title = 'Canada', subtitle = 'this is a subtitle')

Let’s look at a table of the Canada data between 1900 and 1950 to see more information about the dip we see:

dplyr::filter(c1, time >= 1915 & time < 1925)
   geo   name time Life.expectancy
1  can Canada 1915           54.41
2  can Canada 1916           54.87
3  can Canada 1917           55.32
4  can Canada 1918           47.26
5  can Canada 1919           56.23
6  can Canada 1920           56.69
7  can Canada 1921           57.14
8  can Canada 1922           57.14
9  can Canada 1923           57.14
10 can Canada 1924           58.96

in base R we could do something like this, where we select rows in can_data where the time column is between 1915 and 1925.

c1[c1$time %in% 1915:1925,]
    geo   name time Life.expectancy
116 can Canada 1915           54.41
117 can Canada 1916           54.87
118 can Canada 1917           55.32
119 can Canada 1918           47.26
120 can Canada 1919           56.23
121 can Canada 1920           56.69
122 can Canada 1921           57.14
123 can Canada 1922           57.14
124 can Canada 1923           57.14
125 can Canada 1924           58.96
126 can Canada 1925           59.44

We could also use the ggplotly function in the plotly package to make an interactive plot to see which year has the dip:

ggplotly(can_plot)

Look like people born in 1918 have a lower life expectancy, this study indicates this is due to the 1918 flu epidemic: doi: 10.1111/j.1728-4457.2000.00565.x

Note

Looks like people born in 1918 have a lower life expectancy, this study indicates this is due to the 1918 flu epidemic: doi: 10.1111/j.1728-4457.2000.00565.x. We can also see that the time axis goes to year 2100. This is because after 2017, the data includes life expectancy estimates rather than using real data. Take a few minutes to re-name the c1 and c2 variables something more informative and create a new figure that plots the Canada data but omits the year 2017 and after.

ggplot(data = dplyr::filter(gm_lex_raw, name == 'Canada') %>% 
         dplyr::filter(time <= 2016), 
       aes(x = time, y = Life.expectancy)) +
  geom_line() + 
  ggtitle('Canada')

4 Functions

One of the central tenets of reproducible analyses is the DRY principle – Don’t Repeat Yourself. If you find that you are repeating the same lines of code, it might be time to write a function.

Functions take the following basic format:

myfunction <- function(argument_name){
  stuff <- "this is the body of the function
  (it contains statements that use `argument_name`
  to do things and make stuff)"
  return(stuff)}

More formally, R functions are broken up into 3 pieces:

formals()- the list of arguments

body() - code inside the function

environment() - how the function finds the values associated with function names

Here’s an example of a function called roll() that rolls any number of 6-sided dice:

roll <- function(number_of_dice){
    rolled_dice <- sample(
        x = 6, 
        size = number_of_dice, 
        replace = TRUE)
    return(rolled_dice)}
  • The built-in R function sample() is nested inside our roll() function.

  • roll() uses the argument number_of_dice as the size, x is the number of sides on the die, which we have hard-coded as 6, and replace = TRUE means that we are sampling the space of all potential die roll outcomes with replacement.

  • Lastly, we tell the function what it should return (rolled_dice).

To call that function and print the output:

roll(number_of_dice = 10)
 [1] 5 3 5 3 2 4 4 5 6 2

Lets look at the formals()

formals(roll)
$number_of_dice

What about body()?

body(roll)
{
    rolled_dice <- sample(x = 6, size = number_of_dice, replace = TRUE)
    return(rolled_dice)
}

What about environment()?

environment(roll)
<environment: R_GlobalEnv>
Note

Now that you know how to write functions, write a function that will plot the life expectancy of any country and include the name of the country as the title (using ggtitle).

plot_by_country <- function(input_table, country){
  country_data <- input_table[input_table$name %in% country,]
  country_plot <- ggplot(data = country_data, aes(x = time, y = Life.expectancy)) + 
    geom_line() + ggtitle(country)
  return(country_plot)}
plot_by_country(gm_lex_raw, 'Afghanistan')

plot_by_country2 <- function(input_table, country){
  country_data <- input_table %>% dplyr::filter(name == country) 
  country_plot <- ggplot(data = country_data, aes(x = time, y = Life.expectancy)) + 
    geom_line() + ggtitle(country)
  return(country_plot)}
plot_by_country2(gm_lex_raw, 'Afghanistan')

5 lapply

What if you want to run your function on a list of things? We can use lapply, which will apply a function (FUN) to a list (X). It will return a list of the same length as X, which is the result of applying FUN to the elements of X. The usage is: lapply(X, FUN)

We can try an example with the life expectancy data. First, make a list of 3 countries:

country_list <- c('Canada', 'United States','Mexico')

Then we can try running our plotting function on the list, specifying the other function argument input_table:

plot_list <- lapply(country_list, plot_by_country, input_table = gm_lex_raw)

We could also set it up this way:

plot_list_2 <- lapply(country_list, function(x) plot_by_country(input_table = gm_lex_raw, country = x))

Or if it is a function we only want to use once, we might want to set things up as an anonymous function:

plot_list_3 <- lapply(country_list, function(z) {
  ggplot(data = gm_lex_raw[gm_lex_raw$name %in% z,],
         aes(x = time, y = Life.expectancy)) +
    geom_line() +
    ggtitle(z)
})

using lapply() with power analysis

Here’s an example where we are trying to run a power analysis.

The arguments for pwr.anova.test are as follows:

k = number of groups

n = number of obervations (per group)

f = effect size

sig.level = Significance level (Type I error probability

power = Power of test (1 minus Type II error probability)

You can also see the description and arguments for the pwr.anova.test function by typing ?pwr.anova.test in the console.

Let’s say we have 5 groups and want a power of 80% with effect size of 0.25 and a significance level of 0.05.

a <- pwr.anova.test(k=5,
               f=.25,
               sig.level=.05,
               power=.8)
a

     Balanced one-way analysis of variance power calculation 

              k = 5
              n = 39.1534
              f = 0.25
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

What if we have a different number of participants?

30 participants

b <- pwr.anova.test(k=5,
               n=30,
               f=.25,
               sig.level=.05)
bpower <- b$power

35 participants

d=pwr.anova.test(k=5,
                 n=35,
                 f=.25,
                 sig.level=.05)
dpower <- d$power

25 participants

e=pwr.anova.test(k=5,
                 n=25,
                 f=.25,
                 sig.level=.05)
epower <- e$power

40 participants

f=pwr.anova.test(k=5,
                 n=40,
                 f=.25,
                 sig.level=.05)

Then we can make a few vectors of our results, including the power and n number of participants. Then we use those vectors to make a plot:

power <- c(.8,bpower,dpower,epower,f$power)
n <- c(a$n,30,35,25,f$n)
plot(n, power, type="b")

Note that this plot looks pretty terrible – we haven’t sorted the number of participants, so R is trying to connect the data points in the order you’ve provided them. We are also repeating ourselves quite a bit.

Here’s another example of how to adjust the power – what effect size can we detect at 80% power with 30 participants?

c = pwr.anova.test(
  k = 5,
  n = 30,
  sig.level = .05,
  power = .8)
c

     Balanced one-way analysis of variance power calculation 

              k = 5
              n = 30
              f = 0.2867269
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group
c$power
[1] 0.8
Note

Now that you know how to write functions and use lapply(), write an lapply() statement that will make a plot with sample size between 2 and 45 on the x-axis and the power for effect size 0.25 on the y-axis. Use a significance level .05 with 5 groups.

n <- 2:45
power <- lapply(n, function(x) pwr.anova.test(k=5,n=x,f=.25,sig.level=.05)$power)
plot(n,power,type="b")

6 Control Flow

Sometimes we may want to make some decisions about our analysis based on the data – if...else statements are one way to achieve this. The general format of an if..else statement is:

returned_vector <- ifelse(test_expression, x, y)

We can try an example – first generate a list of numbers, 0-10 and then write an ifelse statement that will divide each number by two and if there’s no remainder, then print even, otherwise print odd.

numbers <- 0:10
ifelse(numbers %% 2 == 0,"even","odd")
 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even"

We could also have it return the contents of number if there’s a remainder:

ifelse(numbers %% 2 == 0,"even",numbers)
 [1] "even" "1"    "even" "3"    "even" "5"    "even" "7"    "even" "9"   
[11] "even"
Note

Now that you know how to use ifelse(), add a new column gm_lex_raw to indicate whether or not the data for a given row is a prediction or real data (remember that starting with the year 2017, the data includes life expectancy estimates rather than actual data). As a stretch goal, you can also create a new line plot showing the life expectancy for Spain, where the line is colored based on whether or not the data contains estimates or not.

gm_lex_raw$data_source <- ifelse(gm_lex_raw$time >= 2016, 'estimate', 'not_estimate')
ggplot(
  data = gm_lex_raw %>% dplyr::filter(name == 'Spain'),
  aes(x = time, y = Life.expectancy, color = data_source)
) +
  geom_line() +
  ggtitle('Spain')

7 PDF parsing

Many scientific articles and documents are in PDF format. This isn’t the easiest format for reproducible analysis, but here are packages to parse text information from PDFs.

I’ve saved a PDF of US GDP per year (also from Gap Minder). Let’s use pdftools to import it. We will use the pdf_ocr_text function, which uses Optical Character Recognition to read the PDF.

gm_gdp_us <- pdf_ocr_text(here('data/raw/GM_gdp_US.pdf'))
Converting page 1 to GM_gdp_US_1.png... done!
Converting page 2 to GM_gdp_US_2.png... done!
Converting page 3 to GM_gdp_US_3.png... done!
Converting page 4 to GM_gdp_US_4.png... done!
Converting page 5 to GM_gdp_US_5.png... done!
Converting page 6 to GM_gdp_US_6.png... done!
head(gm_gdp_us[[1]])
[1] "geo name time Income_per_person GDP_total GDP_per_capita_growth\nusa United States 1800 2959 17753987800 1.82\nusa United States 1801 3013 18419589752 1.82\nusa United States 1802 3064 19085435487 1.69\nusa United States 1803 3011 19115338971 -1.71\nusa United States 1804 3008 19458503833 -0.1\nusa United States 1805 3073 20251048177 2.14\nusa United States 1806 3128 21006434471 1.8\nusa United States 1807 3159 21619557565 1\nusa United States 1808 2937/7 20476849053 -7.05\nusa United States 1809 3083 21905771676 4.99\nusa United States 1810 3168 23137219513 2.76\nusa United States 1811 3232 24252290785 2.03\nusa United States 1812 3176 24541848249 -1.76\nusa United States 1813 3212 25628673616 1.13\nusa United States 1814 3273 26969646728 1.91\nusa United States 1815 3259 27735002649 -0.41\nusa United States 1816 3128 27484206474 -4.03\nusa United States 1817 3125 28353681313 -0.1\nusa United States 1818 3132 29348647857 0.24\nusa United States 1819 3128 30237986414 -0.14\nusa United States 1820 3108 30980681930 -0.62\nusa United States 1821 3156 32406579604 1.54\nusa United States 1822 3244 34291619050 2.79\nusa United States 1823 3209 34884680185 -1.11\nusa United States 1824 3297 36870813981 2./5\nusa United States 1825 3368 38753033550 2.18\nusa United States 1826 3392 40146913757 0.71\nusa United States 1827 3406 41461516592 0.4\nusa United States 1828 3406 42650068401 0\nusa United States 1829 3294 42422676146 -3,29\nusa United States 1830 3533 46793256716 7.26\nusa United States 1831 3750 51065293925 6.13\nusa United States 1832 3886 54403832944 3.63\nusa United States 1833 4022 57882073073 3.5\nusa United States 1834 3830 56669999467 -4.76\nusa United States 1835 3975 60458930371 3.78\nusa United States 1836 4034 63060286710 1.47\nusa United States 1837 3906 62781668297 -3.15\nusa United States 1838 3871 63944212742 -0.92\nusa United States 1839 4059 68966993518 4.86\nusa United States 1840 3859 67469495254 -4.93\nusa United States 1841 3781 68066704282 -2.01\n"

This looks very messy – we can see lots of new line characters (\n) spread throughout the string. We can use str_split from stringr to split up the characters by the \n character.

gm_gdp_us_split <- str_split(string = gm_gdp_us, pattern = "\n")

We’ve got a nested list – use unlist to flatten it.

gm_gdp_us_split_unlist <- unlist(gm_gdp_us_split)

Use read.table to convert into a dataframe, setting row.names = NULL will number the rows (instead of trying to use the column row.names to name them).

gm_gdp_us_imported <- read.table(text = gm_gdp_us_split_unlist, row.names = NULL)
head(gm_gdp_us_imported)
  row.names    geo   name time Income_per_person   GDP_total
1       usa United States 1800              2959 17753987800
2       usa United States 1801              3013 18419589752
3       usa United States 1802              3064 19085435487
4       usa United States 1803              3011 19115338971
5       usa United States 1804              3008 19458503833
6       usa United States 1805              3073 20251048177
  GDP_per_capita_growth
1                  1.82
2                  1.82
3                  1.69
4                 -1.71
5                  -0.1
6                  2.14

Our column names have gotten a little broken up – let’s try to fix them using paste , which will convert vectors to characters and then concatenate them.

gm_gdp_us_imported$name_fixed <- paste(gm_gdp_us_imported$geo, gm_gdp_us_imported$name, sep = " ")

Then we can drop geo and name, rename the row.names column as geo, and drop the row.names column.

gm_gdp_us_imported$name <- NULL
gm_gdp_us_imported$geo <- NULL
gm_gdp_us_imported$geo <- gm_gdp_us_imported$row.names
gm_gdp_us_imported$row.names <- NULL

8 Joins

Another potential use case where R can make your analysis pipeline more reproducible is the situation where you have two tables of data that you want to join based on some common identifier – perhaps using an Excel VLOOKUP function. You can also do these using mutating joins from dplyr.

inner_join(): includes only rows in both x and y.

left_join(): includes all rows in x.

right_join(): includes all rows in y.

full_join(): includes all rows in either x or y.

Let’s look at some examples using band_members, band_instruments and band_instruments2

data("band_instruments")
data("band_instruments2")
data("band_members")

head(band_instruments)
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar
head(band_instruments2)
# A tibble: 3 × 2
  artist plays 
  <chr>  <chr> 
1 John   guitar
2 Paul   bass  
3 Keith  guitar
head(band_members)
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

Let’s say we want to join band_members and band_instruments:

inner_example <- inner_join(band_members, band_instruments)
Joining with `by = join_by(name)`
head(inner_example)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Note that R will tell you which variable it used for the join – here it was able to detect that we had two columns called name and it automatically used that column for the join.

If you want to use band_instruments2 instead of band_instruments, you have to specify a character vector of variable names to join by. If we are joining x and y, by = c("a", "b") joins x$a to y$a and x$b to y$b. If variable names differ between x and y, use a named character vector like by = c("x_a" = "y_a", "x_b" = "y_b"). For example:

inner_example2 <- inner_join(band_members, band_instruments2, by = c('name' = 'artist'))
head(inner_example2)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

If we swap the order of the arguments, we have to swap the order of the column specification and the name column is now artist in the output.

inner_example3 <- inner_join(band_instruments2, band_members, by = c('artist' = 'name'))
head(inner_example3)
# A tibble: 2 × 3
  artist plays  band   
  <chr>  <chr>  <chr>  
1 John   guitar Beatles
2 Paul   bass   Beatles

We can also run an example full_join:

full_example <- full_join(band_members, band_instruments)
Joining with `by = join_by(name)`
full_example
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

And right_join or left_join:

right_example <- right_join(band_members, band_instruments)
Joining with `by = join_by(name)`
right_example
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar
left_example <- left_join(band_members, band_instruments)
Joining with `by = join_by(name)`
left_example
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
Note

Now that you know how to use mutating joins, make a new object that joins the life expectancy data (gm_lex_raw) and the US GP data (gm_gdp_us_imported).

gm_data_joined <- inner_join(gm_lex_raw, gm_gdp_us_imported, by = c('geo' = 'geo', 'name' = 'name_fixed', 'time' = 'time'))